fix(video): fix video stream freezing on capture re-init (e.g. pipewire display switch)#5249
Conversation
b29d83a to
d633d52
Compare
bb8e6da to
1003b5d
Compare
1003b5d to
f8bb8ae
Compare
Bundle ReportBundle size has no change ✅ |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5249 +/- ##
==========================================
- Coverage 27.55% 27.54% -0.01%
==========================================
Files 113 113
Lines 25589 25596 +7
Branches 11238 11239 +1
==========================================
+ Hits 7050 7051 +1
+ Misses 16593 15284 -1309
- Partials 1946 3261 +1315
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 64 files with indirect coverage changes Continue to review full report in Codecov by Harness.
|
|
Does this resolve #4943 then? And, do you want to merge this or wait for tester feedback (it's unclear from the PR description)? |
|
It's the fix for the second issue that I've identified in #4943 after commenting out the frame flush (still waiting on feedback in #4943 for that). The video stream freeze issue can occur independently although I'm unsure about the root cause as it's some kind of race condition with an extremely tight timing. It would probably be good to have someone else look at this if possible but I'd also rather not delay this too long in case it also helps with #5241. Maybe @psyke83 could check this if his time permits? By itself this shouldn't break anything as it is only adding 100us to the display switch capture loop return (and only to that specific case). I can also add the removal of the frame flush to this PR depending on the final verdict in #4943 if you want to wait for that. |
f4d7ea3 to
db9699b
Compare
|
To speed this up I've removed the logic order change (as that's the thing I was really unsure about) for the capture timeout case (as it is working as it is now). If someone ever finds and fixes the root cause or has a better solution to avoid the video stream freeze the workaround can/should be replaced/removed. The way it is now I'm good with merging the change. Still needs a fix for the FFmpeg segfault to fully handle #4943 for that see #5257, otherwise it needs to be fixed in upstream FFmpeg. |
|
You mention that the issue can't be reproduced when Sunshine debug is active, but have you tried checking (only) the pipewire debug output for any clues when this happens? Adding PIPEWIRE_DEBUG=4 into Sunshine's runtime environment should do the trick. |
|
Pipewire debug output (although it is a lot) shows the following right before the stream freezes: If I'm reading this correctly then pipewire should be streaming already (first line) but whatever happens in Sunshine's video capture loop gets stuck without that tiny delay (Audio still plays, further display switching breaks likely due to As already noted I've not found a way to properly debug this yet as even the debugger (or debug logging) throws the timing off enough to not trigger the issue. With this PR applied I'm getting the same Pipewire debug output but this time the encoder properly creates and starts up: The only difference on pipewire debug output I've seen once after was (and that only because I've been button mashing CTRL+ALT+SHIFT+F1 as hard as possible): The freeze only happens on portal-/kwingrab, not on kmsgrab and I've also tested both vulkan and vaapi encoders to make sure it's not related to them. I'm not fully satisfied with adding a microsleep to workaround this issue but so far it's the best thing I could come up with and it's the only thing that has proven to reliably work. Therefore I've marked this as a workaround in code until a better solution is found. |
|
Can you see if the issue can reproduce with active content on the host? Leave vrrtest running, for example. When troubleshooting mode change hangs there was one specific case where a hang occurred only when changing modes on an idle desktop. |
|
Also happens on non-idle stream. Tested with running a YouTube video. That's where I noted audio continuing despite the video stream freeze. |
|
To clarify: The issue does not occur on every display switch but has a noticeable chance to occur (like 3 in 10 switches). It can also be force by button mashing any display switch button repeatedly (Note: you might trigger #4943 as well without the proper fix in place). |
|
After looking into this I've unfortunately still not come up with a better solution so far. Due to the workaround functioning the way it is implemented hints at the codepath for |
db9699b to
0931ebc
Compare
0931ebc to
4ac5bf0
Compare
e4d1455 to
be6f953
Compare
|
@psyke83 after doing more digging I've noted we're stopping the pw_thread_loop twice. Could you check if 5de0ca7 is the proper fix for this? I'm still seeing hangs when switching without the added delay unfortunately. I've also found out that having more that one clients connected adds enough delay to throw this racecondition off. I've reduced the workaround delay by another magnitude (to 10us) and still cannot get hangs that way (at least so far) but can force them easily without an added delay. |
be6f953 to
37b59df
Compare
|
I've finally managed to get a debugger attached when the issue has already occurred and the Lines 42 to 60 in 8a0525a EDIT: Could also be that the while-loop is not exiting here as The next (and only remaining) function in the call stack for the capture thread is Lines 1450 to 1469 in 8a0525a Meanwhile the Lines 2489 to 2494 in 8a0525a |
eb8607d to
71dd734
Compare
|
Unfortunately, doesn't work so I've reverted this back to the sleep workaround. @psyke83 I'd be grateful if you could have a look at the debugger results if your time permits? Hopefully you have a deeper understanding of what's really going on (or rather wrong) here. |
71dd734 to
a0c106d
Compare
|
Digging into this some more I've noticed that
With @psyke83 Since this was your change could you advise the best course of action? I've tried a fix with EDIT: Added EDIT2: ac5b4e9 is unfortunately not enough to fix this as I can still trigger the issue (although not as often as before) EDIT3: Trying with fully reverted EDIT4: Issue is still occurring with fully reverted |
b212fa3 to
a0c106d
Compare
|
I'm trying to get caught up; your proposed fix for the crash in #5257 works for me, but only for VAAPI; Vulkan still crashes. So to test this PR I'm keeping #5257 applied, but sticking to VAAPI and kwingrab. Additionally, to eliminate variables during testing, I'm keeping a glxgears window running that's intersecting on both screens to rule out issues with pipewire genuinely not sending buffers due to an idle screen. I'm also making sure to leave the systray enabled, as I recall that was causing odd issues with previous bugs related to pipewire capture. With the above noted, I'm able to reproduce the freezing which is definitely ameliorated with this PR. If I revert a0c106d but keep the pw_thread_loop commit, I can once again reproduce the freezing, but unlike what you've reported, it does seem to resolve the freezing if I revert my recent changes to pop() in thread_safe.h. This is not related to this PR directly, but Vulkan seems to be unstable with both display and mode switching on my system on master branch, with both of your PRs, or even with the thread_safe.h changes reverted. I wrote a stress script for KDE: With a timeout smaller than 5 this almost immediately triggers a crash with Vulkan, but VAAPI survives 100 cycles even with 0 sleep. I would suggest checking to see if mode changes cause crashing in Vulkan on your system; if so, Vulkan may need separate fixes that are interfering with your testing of display switching freezes in this PR and crashes in the other PR. |
|
@psyke83 Thanks for looking into this and also for working on/creating a reliable reproducer. I probably shouldn't have put 4 edits in my last comment (didn't want to make too many separate small comments) but I've also realized that reverting thread_safe.h does indeed not fix this issue. I've just not managed to trigger it when writing my initial comment. The freeze however is related to an indefinite wait in Lines 116 to 118 in 0731729 But logically it would make more sense for the whole free images while block to be atomically locked (so a peek() being true will guarantee pop() has something to pop at all) but this might entail a new function in thread_safe.h to implement properly: Lines 1464 to 1466 in 0731729 As for the vulkan crash I've managed to hit that once so far (after trying to add lock_guard to EDIT: I've looked through my logs and can see that the vulkan crash might be a double-free issue due to a race condition as I'm seeing the following on top of the related coredump: EDIT2: For the vulkan crash it's in |
|
Lines 1464 to 1466 in 0731729 Looking at this section more, it might make sense to just use a pop with a delay/timeout here as it should still keep going while peek is true but should never wait infinitely. |
|
I think you've identified the source of the freeze, but isn't that also papering over the issue with a timeout? There seems to be a deadlock in the unbounded peek() -> pop() calls that's specific to pipewire, and seems likelier to occur with my changes to the unbounded pop() function (but you noted that reverting it didn't fix it on your system). The peek() -> pop() cycle may be causing issues due to multiple callers/consumers that causes a peek to return true but the pop() is already empty (perhaps due to the async capture queue?). Although the peek() -> pop() call style is present in multiple parts of the code, can you try this diff focused specific on the problematic part? (full diff vs this PR + #5257): I can't reproduce any freeze with this change, but I'm unsure if there's a more appropriate fix to the issue. If this looks like the right track, we might want to evaluate other instances of unbounded peek() -> pop() calls that may also be susceptible to deadlocking. A quick grep makes me wonder if stream.cpp might benefit from this type of change as well. Edit: to clarify, I tried adding locking to the peek() function before trying the "try_pop()" strategy, but it didn't seem to help resolve the deadlock. I am curious why you were still seeing deadlocks with the reverted pop() function but I wasn't; perhaps I didn't test enough to reproduce the issue. |
What I've been suggesting with my earlier comment was doing that part like this (I've tried this and it seems to work): diff --git a/src/video.cpp b/src/video.cpp
index ae7050ab..5500fc06 100644
--- a/src/video.cpp
+++ b/src/video.cpp
@@ -1462,7 +1462,7 @@ namespace video {
}
while (capture_ctx->images->peek()) {
- capture_ctx->images->pop();
+ capture_ctx->images->pop(20ms);
}
++capture_ctx;This does logically the same thing as what you're doing with As for the other codeparts that could potentially be deadlocking I'm unsure whether those should be touched. The
I've had that happen as well when adding a lock to |
|
Yes, the It seems reasonable to assume that one of these loops explains #5241; in fact, if we assume that a reinit was triggered at the time of those freezes, then this exact call site is also the exact culprit. Needless to say, this change didn't improve the display or mode switch crashing for Vulkan on my system, so you're likely correct that it's an ffmpeg bug. I don't recall it being so unstable during mode changes when I last tested some weeks/months ago. |
Agreed, for queue draining that exact deadlock scenario is the problem. It's especially critical in this specific case because it's draining the captured images from the capture context, which when being stuck on
That assumption is also reasonable as the
I've only been able to trigger this issue once so far (on fully updated CachyOS with 9700XT) but that race condition might be dependent on multiple factors (FFmpeg/Mesa/Hardware) and should be debugged further in a separate issue/PR. |
823ce83 to
e787fbc
Compare
e787fbc to
9c44762
Compare
|
@Kishi85 is this ready to merge? Edit: I recently made a change forcing more doxygen documentation. |
Before it was stopped at the beginning and end of the destructor. Doing it once at the end is what makes sense logically.
This is a trial-and-error workaround to avoid video stream freezing when display switching that somehow happens when not using the added thread sleeps due to a racecondition that needs to be further diagnosed. Pipewire state change logging is also move to information log level to improve initial issue analysis without enabling debug logging
…in video.cpp Also remove sleep workaround from pipewire.cpp Co-Authored-By: psyke83 <psyke83@users.noreply.github.com>
9c44762 to
1c97c44
Compare
Yes, it's looking good to me and @psyke83 . Also been testing this for the last day without any issues. For the remaining codeparts that have deadlock potential I'll open an issue so checking those will not be forgotten. @psyke83 Would you open an issue (or PR) for the vulkan crash? I'm not experiencing this reliably enough to provide the necessary details for that.
Added, always good to have more documentation (where appropriate). |
|
Note: This will just fix one issue that can occur when display switching (especially with event-based methods like pipewire). Other still known issues:
|
|
…re display switch) (LizardByte#5249) Co-authored-by: psyke83 <psyke83@users.noreply.github.com> (cherry picked from commit e40d355)



Description
This is a trial-and-error workaround to avoid video stream freezing when display switching that somehow happens when not using the added thread sleeps due to a racecondition that needs to be further diagnosed and might even be outside Sunshine's control. Adding a 0.01ms sleep at the relevant codepoints (see changes) seems to fix the issue reliably without creating unwanted behaviour/other issues.
Also the log level for pipewire state changes to ease initial analysis of future issues without having debug logging enabled.
This issue was identified after working around the FFmpeg segfault in #4943 but might occur independently from that. Unsure if this can also help with #5241 but should be tested as it might be the same root cause.
Note: This issue is almost impossible to pinpoint as it is only occurring when running as a systemd user service without having debug logging enabled (so
BOOST_LOG(debug)does noop). If run with debug logging enabled, the added runtime by BOOST_LOG is sufficient to allievate the issue. Same is true if running with a debugger attached. The only way to figure out the relevant code segment was by changing the loglevel for each log statement inpipewire.cppand trying to trigger the issue. After doing that the following codesegment was found to fix the issue when using log level info:Sunshine/src/platform/linux/pipewire.cpp
Lines 866 to 869 in 1f0455c
That is why the workaround now adds a sleep right before the return (and does so also for the timeout case).
Screenshot
Issues Fixed or Closed
Roadmap Issues
Type of Change
Checklist
AI Usage